introduction

In this lab, we’ll explore the basics of spatial and geometry operations on vector data in R using the sf package. The following materials are modified from Chapter 4 and Chapter 5of Geocomputation with R by Rovin Lovelace.

Vector Operations

# Looking at initial datasets to be used
nz
## Simple feature collection with 16 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax: 6191874
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##                 Name Island Land_area Population Median_income Sex_ratio
## 1          Northland  North 12500.561     175500         23400 0.9424532
## 2           Auckland  North  4941.573    1657200         29600 0.9442858
## 3            Waikato  North 23900.036     460100         27900 0.9520500
## 4      Bay of Plenty  North 12071.145     299900         26200 0.9280391
## 5           Gisborne  North  8385.827      48500         24400 0.9349734
## 6        Hawke's Bay  North 14137.524     164000         26100 0.9238375
## 7           Taranaki  North  7254.480     118000         29100 0.9569363
## 8  Manawatu-Wanganui  North 22220.608     234500         25000 0.9387734
## 9         Wellington  North  8048.553     513900         32700 0.9335524
## 10        West Coast  South 23245.456      32400         26900 1.0139072
##                              geom
## 1  MULTIPOLYGON (((1745493 600...
## 2  MULTIPOLYGON (((1803822 590...
## 3  MULTIPOLYGON (((1860345 585...
## 4  MULTIPOLYGON (((2049387 583...
## 5  MULTIPOLYGON (((2024489 567...
## 6  MULTIPOLYGON (((2024489 567...
## 7  MULTIPOLYGON (((1740438 571...
## 8  MULTIPOLYGON (((1866732 566...
## 9  MULTIPOLYGON (((1881590 548...
## 10 MULTIPOLYGON (((1557042 531...
nz_height
## Simple feature collection with 101 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 1204143 ymin: 5048309 xmax: 1822492 ymax: 5650492
## Projected CRS: NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##    t50_fid elevation                geometry
## 1  2353944      2723 POINT (1204143 5049971)
## 2  2354404      2820 POINT (1234725 5048309)
## 3  2354405      2830 POINT (1235915 5048745)
## 4  2369113      3033 POINT (1259702 5076570)
## 5  2362630      2749 POINT (1378170 5158491)
## 6  2362814      2822 POINT (1389460 5168749)
## 7  2362817      2778 POINT (1390166 5169466)
## 8  2363991      3004 POINT (1372357 5172729)
## 9  2363993      3114 POINT (1372062 5173236)
## 10 2363994      2882 POINT (1372810 5173419)

Spatial Subsetting

Spatial subsetting is the process of converting a spatial object into a new object containing only the features that relate in space to another object. This is analogous the attribute subsetting that we covered last week. There are many ways to spatially subset in R, so we will explore a few.

Let’s start by going back to the New Zealand datasets and find all the high points in the state of Canterbury.

#filtering to only select canterbury
canterbury <- nz %>% 
  filter(Name == "Canterbury")

#this gives us all the points that fall inside of canterbury
c_height <- nz_height[canterbury,]

#making a map to check
tm_shape(nz) +
  tm_polygons() +
  tm_shape(nz_height)+
  tm_dots(col = "red")

#only mapping the points that fall into canterbury
tm_shape(nz) +
  tm_polygons() +
  tm_shape(c_height) +
  tm_dots(col = "red")

# The default is to subset to features that intersect, but we can use other operations, including finding features that do not intersect.
#creating for opposite, using st_dis
outside_height <- nz_height[canterbury, , op = st_disjoint]

#map it
tm_shape(nz) +
  tm_polygons() +
  tm_shape(outside_height) +
  tm_dots(col = "red")

#We can perform the same operations using topological operators. These operators return matrices testing the relationship between features.
#st_intersects (if it touches at all, include it)

sel_sgbp <- st_intersects(x = nz_height, y = canterbury)

#turning this into a logical (anything greater than 0 are the rows we want because that means they intersect it)
sel_logical <- lengths(sel_sgbp) > 0

c_height2 <-nz_height[sel_logical, ]
tm_shape(nz) +
  tm_polygons() +
  tm_shape(c_height2) +
  tm_dots(col = "red")

Points in Canterbury Again We can also use the st_filter function in sf

#Can do the same thing with tidy commands using st_filter
c_height3 <- nz_height %>% 
  st_filter(y=canterbury, .predicate = st_intersects) #filter using a spatial intersection

tm_shape(nz) +
  tm_polygons() +
  tm_shape(c_height3) +
  tm_dots(col = "red")

We can change the predicate option to test subset to features that don’t intersect

#We can change the predicate option to test subset to features that don’t intersect
outside_height10 <- nz_height %>% 
  st_filter(y=canterbury, .predicate = st_disjoint)

tm_shape(nz) +
  tm_polygons() +
  tm_shape(outside_height10) +
  tm_dots(col = "red")

Creating Bounding Box

Spatial joining

Where attribute joining depends on both data sets sharing a ‘key’ variable, spatial joining uses the same concept but depends on spatial relationships between data sets.

Let’s test this out by creating 50 points randomly distributed across the world and finding out what countries they call in.

#creating bounding box

bb <- st_bbox(world)

#setting random points within the dataframe
random_df <- data.frame(
  x=runif(n=10,min = bb[1], max = bb[3]),
  y=runif(n=10,min = bb[2], max = bb[4])
)

#turning a data frame into an sf object
random_points <- random_df %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  st_set_crs("EPSG:4326")

#checking that random points is now an sf object
class(random_points)
## [1] "sf"         "data.frame"
#mapping this
tm_shape(world) +
  tm_fill() +
  tm_shape(random_points) +
  tm_dots(col = "red")

#figuring out what countries have the points in them
world_random <- world[random_points, ]
tm_shape(world_random) +
  tm_fill() #just filling in shapes that have points in them

#Practicing Using World Random again 
#Let’s first use spatial subsetting to find just the countries that contain random points.
bb <- st_bbox(world)

random_df <- data.frame(
  x = runif(n = 10, min = bb[1], max = bb[3]), #bounding box for the x is 1 and 3
  y = runif(n=10, min = bb[2], max = bb[4]) #bounding box for the y is 2 and 4
) #this will create a datframe with 10 values with x and y objects 


world_random <- world[random_points,] #this is how to spatially subset using the rows which align in each data frame provided

random_points <- random_df %>%
  st_as_sf(coords = c("x", "y")) %>% #c with the name of the x and y values
  st_set_crs("EPSG:4326") #setting the coordinate reference system 

tm_shape(world) +
  tm_fill() +
  tm_shape(random_points) +
  tm_dots(col = "red")

#Now let’s perform a spatial join to add the info from each country that a point falls into onto the point dataset.
random_joined <- st_join(random_points, world) #joining points that fall inside the countries 

#By default, st_join performs a left join. We change this and instead perform an inner join.
random_joined_inner <- st_join(random_points, world, left = FALSE) #joining points that fall outside of the countries

Non-Overlapping Joins

Sometimes we might want join geographic datasets that are strongly related, but do not have overlapping geometries. To demonstrate this, let’s look at data on cycle hire points in London.

#Non-overlapping joins 

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(cycle_hire) +
  tm_dots(col = "blue", alpha = 0.5) + #alpha is changing the transparency 
  tm_shape(cycle_hire_osm) +
  tm_dots(col = "red", alpha = 0.5)
#We can check if any of these points overlap.
any(st_touches(cycle_hire, cycle_hire_osm, sparse = FALSE)) #none of these points actually touch
## [1] FALSE
#Let’s say we need to join the ‘capacity’ variable in ‘cycle_hire_osm’ onto the official ‘target’ data in ‘cycle_hire’. The simplest method is using the topological operator st_is_within_distance().

sel <- st_is_within_distance(cycle_hire, cycle_hire_osm, dist = 20)
summary(lengths(sel) > 0) #summarizes the number of points within 20 meters
##    Mode   FALSE    TRUE 
## logical     304     438
#Now, we’d like to add the values from ‘cycle_hire_osm’ onto the ‘cycle_hire’ points.
z <- st_join(cycle_hire, cycle_hire_osm, st_is_within_distance, dist = 20)
nrow(cycle_hire) #742 rows in original data
## [1] 742
nrow(z) #762 rows in new data
## [1] 762
#Note: the number of rows of the join is larger than the number of rows in the original dataset. Why? Because some points in ‘cycle_hire’ were within 20 meters of multiple points in ‘cycle_hire_osm’. If we wanted to aggregate so we have just one value per original point, we can use the aggregation methods from last week.
z <- z %>% 
  group_by(id) %>% 
  summarize(capacity = mean(capacity))

nrow(z) #now they are the same length
## [1] 742

Spatial Aggregation

Similar to attribute data aggregation, spatial data aggregation condenses data (we end up with fewer rows than we started with).

Let’s say we wanted to find the average height of high point in each region of New Zealand. We could use the aggregate function in base R.

nz_agg <- aggregate(x = nz_height, by = nz, FUN = mean)
#x object is the information we want to aggregate, and we want to aggregate it by the y value, then the FUN indicates the function we want to use to aggregate 

The result of this is an object with the same geometries as the aggregating feature data set (in this case ‘nz’).

We could also use a sf/dplyr approach.

nz_agg <- st_join(nz, nz_height) %>% #joining onto the aggregating dataset so nz is on the left (because that is what we are joining onto)
  group_by(Name) %>% 
  summarize(elevation = mean(elevation, na.rm = TRUE))

joining incongruent layers

We might want to aggregate data to geometries that are not congruent (i.e. their boundaries don’t line up). This causes issues when we think about how to summarize associated values.

head(incongruent)
## Simple feature collection with 6 features and 2 fields
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity, 1 NA's
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 418580 ymin: 443703 xmax: 422963 ymax: 446978
## Projected CRS: OSGB 1936 / British National Grid
##         level    value                       geometry
## 1 Incongruent 4.037919 MULTIPOLYGON (((420799.6 44...
## 2 Incongruent 5.014419 MULTIPOLYGON (((418664 4464...
## 3 Incongruent 4.933000 MULTIPOLYGON (((419964 4462...
## 4 Incongruent 5.120139 MULTIPOLYGON (((420368 4441...
## 5 Incongruent 6.548912 MULTIPOLYGON (((420419.8 44...
## 6 Incongruent 3.749791 MULTIPOLYGON (((421779 4451...
head(aggregating_zones)
## Simple feature collection with 2 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 417686.2 ymin: 443703.6 xmax: 422959.3 ymax: 447036.8
## Projected CRS: OSGB 1936 / British National Grid
##       geo_code geo_label geo_labelw                       geometry
## 5164 E02002332 Leeds 003       <NA> MULTIPOLYGON (((418731.9 44...
## 6631 E02002333 Leeds 004       <NA> MULTIPOLYGON (((419196.4 44...
tm_shape(incongruent) +
  tm_polygons() +
  tm_shape(aggregating_zones) +
  tm_borders(col = "red")

The simplest method for dealing with this is using area weighted spatial interpolation which transfers values from the ‘incongruent’ object to a new column in ‘aggregating_zones’ in proportion with the area of overlap.

iv <- incongruent["value"]
agg_aw <- st_interpolate_aw(iv, aggregating_zones, extensive = TRUE) #aggregating zones are what we are aggregating to
## Warning in st_interpolate_aw.sf(iv, aggregating_zones, extensive = TRUE):
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
tm_shape(agg_aw) +
  tm_fill(col = "value")

distance relationships

While topological relationships are binary (features either intersect or don’t), distance relationships are continuous.

We can use the following to find the distance between the highest point in NZ and the centroid of the Canterbury region.

nz_highest <- nz_height %>% 
  slice_max(n = 1, order_by = elevation) #this will give us the highest point in new zealand


canterbury <- nz %>% 
  filter(Name == "Canterbury")

#finding the centroid of canterbury
canterbury_centroid = st_centroid(canterbury)
## Warning in st_centroid.sf(canterbury): st_centroid assumes attributes are
## constant over geometries of x
#finding the distance
st_distance(nz_highest, canterbury_centroid)
## Units: [m]
##        [,1]
## [1,] 115540
#Note: this function returns distances with units (yay!) and as a matrix, meaning we could find the distance between many locations at once.

Simplification

Simplification generalizes vector data (polygons and lines) to assist with plotting and reducing the amount of memory, disk space, and network bandwidth to handle a dataset.

Let’s try simplifying the US states using the Douglas-Peucker algorithm. GEOS assumes a projected CRS, so we first need to project the data, in this case into the US National Atlas Equal Area (epsg = 2163)

us_states2163 <- st_transform(us_states, "EPSG:2163")
us_states2163 = us_states2163

us_states_simp1 = st_simplify(us_states2163, dTolerance = 100000)

tm_shape(us_states_simp1) +
  tm_polygons()
## Warning: The shape us_states_simp1 contains empty units (after reprojection).
# proportion of points to retain (0-1; default 0.05)
us_states_simp2 = rmapshaper::ms_simplify(us_states2163, keep = 0.01, keep_shapes = TRUE) #this keeps the shapes/connections between the states
## Warning in ms_de_unit(input): Coercing these 'units' columns to class numeric:
## AREA
#mapping this

tm_shape(us_states_simp2) +
  tm_polygons()

Instead of simplifying, we could try smoothing using Gaussian kernel regression.

us_states_simp3 = smoothr::smooth(us_states2163, method = 'ksmooth', smoothness = 6)

tm_shape(us_states_simp3) +
  tm_polygons()

centroids

Centroids identify the center of a spatial feature. Similar to taking an average, there are many ways to compute a centroid. The most common is the geographic centroid.

nz_centroid <- st_centroid(nz)
## Warning in st_centroid.sf(nz): st_centroid assumes attributes are constant over
## geometries of x
tm_shape(nz) +
  tm_fill() +
  tm_shape(nz_centroid) +
  tm_dots()

Sometimes centroids fall outside of the boundaries of the objects they were created from. In the case where we need them to fall inside of the feature, we can use point on surface methods.

nz_pos <- st_point_on_surface(nz)
## Warning in st_point_on_surface.sf(nz): st_point_on_surface assumes attributes
## are constant over geometries of x
tm_shape(nz) +
  tm_fill() +
  tm_shape(nz_centroid) +
  tm_dots() +
  tm_shape(nz_pos) +
  tm_dots(col = "red")

buffers

Buffers create polygons representing a set distance from a feature.

seine_buffer <- st_buffer(seine, dist = 5000)

tm_shape(seine_buffer) +
  tm_polygons()

unions

As we saw in the last lab, we can spatially aggregate without explicitly asking R to do so.

w <- tm_shape(world) +
  tm_polygons()

continents <- world %>%
  group_by(continent) %>%
  summarize(population = sum(pop, na.rm = TRUE))

c <- tm_shape(continents) +
  tm_polygons()

tmap_arrange(w, c)

What is going on here? Behind the scenes, summarize() is using st_union() to dissolve the boundaries.

us_west <- us_states %>%
  filter(REGION == "West")

us_west_union <- st_union(us_west)

tm_shape(us_west_union) +
  tm_polygons()

st_union() can also take 2 geometries and unite them.

texas <-  us_states %>%
  filter(NAME == "Texas")

texas_union = st_union(us_west_union, texas)

tm_shape(texas_union) +
  tm_polygons()